Introduction

Simple random selection of training and testing folds in the structured environment leads to an underestimation of error in the evaluation of spatial predictions and may result in inappropriate model selection (Telford and Birks, 2009; Roberts et al., 2017). The use of spatial and environmental blocks to separate training and testing sets has been suggested as a good strategy for realistic error estimation in datasets with dependence structures, and more generally as a robust method for estimating the predictive performance of models used to predict mapped distributions (Roberts et al., 2017). Package blockCV provides functions to separate train and test sets using buffers, spatial and clustering blocks (spatial and environmental), i.e. generates folds for k-fold and leave-one-out (LOO) cross-validation. It provides several options for how those blocks are constructed. It also has a function that applies geostatistical techniques to investigate the existing level of spatial autocorrelation in the response or covariates to inform the choice of a suitable distance band by which to separate the data sets. The package has been written with species distribution modelling (SDM) in mind, and the functions allow for a number of common scenarios (including presence-absence and presence-background species data, rare and common species, raster data for predictor variables). Although it can be applied to any spatial modelling e.g. multi-class responses for remote sensing image classification.

Please cite blockCV by: Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol. 2019; 10:225–232. doi: 10.1111/2041-210X.13107

New updates of the version 3.0

There are major updates in blockCV v3.0. All function names have changed to more general names starting with cv_*. The old functions (v2.x) still work, but they will be removed in future versions. Please update your code with the new functions presented bellow.

Some new updates:

  • The three main blocking functions are now: cv_spatial, cv_cluster, and cv_buffer
  • Spatial blocks now support hexagonal (default), rectangular, and user-defined blocks
  • Clustering function now works both on environmental rasters, and spatial coordinates of the sample points
  • The cv_spatial_autocor function now calculates spatial autocorrelation range for either the response (i.e. the binary or continuous data) or a set of continuous raster covariates (as before)
  • The new cv_plot function can be used to plot the folds of all blocking strategy with ggplot facets
  • The terra package is now used for all raster processing with support for stars and raster objects (or files on disk)
  • The new cv_similarity provides measures of possible extrapolation to testing folds

Installation

The blockCV is available in CRAN and the latest update can also be downloaded from GitHub. It is recommended to install the dependencies of the package. To install the package use:

# install stable version from CRAN
install.packages("blockCV", dependencies = TRUE)

# install latest update from GitHub
remotes::install_github("rvalavi/blockCV", dependencies = TRUE)
# loading the package
library(blockCV)

Package data

The package contains the raw format of the following data:

  • Raster covariates of Australia (.tif)
  • Simulated species data (.csv)

These data are used to illustrate how the package is used. The raster data include several bioclimatic variables for Australia. The species data include presence-absence records (binary) of a simulated species.

To load the package raster data use:

library(sf) # working with spatial vector data
library(terra) # working with spatial raster data
library(tmap) # plotting block and maps

# load raster data
# the pipe operator |> is available for R >= 4.1
rasters <- system.file("extdata/au/", package = "blockCV") |>
  list.files(full.names = TRUE) |>
  terra::rast()

The presence-absence species data include 481 presence points and 514 absence points.

# load species presence-asence data and convert to sf
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
head(points)
##           x        y occ
## 1 1168236.6 -3824513   0
## 2  757436.3 -3927213   0
## 3 1767320.5 -3679021   1
## 4 1852903.9 -3251104   1
## 5 1005628.2 -4106938   1
## 6 1416428.5 -4260988   1

The appropriate format of species data for the blockCV package is simple features (from the sf package). The data is provide in GDA2020 / GA LCC coordinate reference system with "EPSG:7845" as defined by crs = 7845. We convert the data.frame to sf as follows:

pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)

Let’s plot the species data using tmap package:

tm_shape(rasters[[1]]) +
  tm_raster(legend.show = FALSE, n = 30, palette = gray.colors(10)) +
  tm_shape(pa_data) +
  tm_dots(col = "occ", style = "cat", size = 0.1)

Block cross-validation strategies

Spatial blocks

The function cv_spatial creates spatial blocks/polygons then assigns blocks to the training and testing folds with random, checkerboard pattern or in a systematic selection. Spatial blocks can be defined either by size or number of rows and columns.

Consistent with other functions, the distance (size) should be in metres, regardless of the unit of the reference system of the input data. When the input map has geographic coordinate system (decimal degrees), the block size is calculated based on dividing size by 111325 (the standard distance of a degree in metres, on the Equator) to change metre to degree. This value can be changed by the user via the deg_to_metre argument.

The offset can be used to shift the spatial position of the blocks in horizontal and vertical axes, respectively. This only works when the block have been built based on size. The blocks argument allows users to define an external spatial polygon as blocking layer.

Here are some spatial block settings:

sb1 <- cv_spatial(x = pa_data,
                  column = "occ", # the response column (binary or multi-class)
                  k = 5, # number of folds
                  size = 350000, # size of the blocks in metres
                  selection = "random", # random blocks-to-fold
                  iteration = 50, # find evenly dispersed folds
                  biomod2 = TRUE) # also create folds for biomod2

The same setting can be used to create square blocks. You can optionally add a raster layer for to be used for creating blocks and be used in the background of the plot.

sb2 <- cv_spatial(x = pa_data,
                  column = "occ",
                  r = rasters, # optional
                  k = 5,
                  size = 350000,
                  hexagon = FALSE,
                  selection = "random",
                  iteration = 50,
                  progress = FALSE,
                  biomod2 = TRUE)
## 
##   train_0 train_1 test_0 test_1
## 1     438     399     76     82
## 2     412     335    102    146
## 3     414     383    100     98
## 4     426     399     88     82
## 5     366     408    148     73

The assignment of folds to each block can also be done in a systematic manner using selection = "systematic".

sb3 <- cv_spatial(x = pa_data,
                  column = "occ",
                  k = 5,
                  size = 350000,
                  hexagon = FALSE,
                  selection = "systematic",
                  iteration = 50)
## 
##   train_0 train_1 test_0 test_1
## 1     417     339     97    142
## 2     334     391    180     90
## 3     468     388     46     93
## 4     432     426     82     55
## 5     405     380    109    101

Or a checkerboard pattern using selection = "checkerboard".

sb4 <- cv_spatial(x = pa_data,
                  column = "occ",
                  size = 350000,
                  hexagon = FALSE,
                  selection = "checkerboard",
                  iteration = 50)
## 
##   train_0 train_1 test_0 test_1
## 1     241     306    273    175
## 2     273     175    241    306

Let’s visualise the checkerboard blocks with tmap:

tm_shape(sb4$blocks) +
  tm_fill(col = "folds", style = "cat")

The blocks can also be created by number of rows and columns:

sb5 <- cv_spatial(x = pa_data,
                  column = "occ",
                  r = rasters,
                  k = 5, 
                  rows_cols = c(10, 2), # for hexagonal only first one is used
                  hexagon = FALSE,
                  selection = "random",
                  iteration = 50,
                  progress = FALSE)
## 
##   train_0 train_1 test_0 test_1
## 1     436     328     78    153
## 2     447     450     67     31
## 3     405     409    109     72
## 4     349     337    165    144
## 5     419     400     95     81

Spatial and environemntal clustering

The function cv_cluster uses clustering methods to specify sets of similar environmental conditions based on the input covariates. Species data corresponding to any of these groups or clusters are assigned to a fold. Alternatively, the clusters can be based on spatial coordinates of sample points (the x argument).

Using spatial coordinate values for clustering:

# spatial clustering
scv <- cv_cluster(x = pa_data,
                  column = "occ", # optional: counting number of train/test records
                  k = 5)
## Warning in cv_cluster(x = pa_data, column = "occ", k = 5): Fold 1 has class(es)
## with zero records
##   train_0 train_1 test_0 test_1
## 1     490     481     24      0
## 2     414     456    100     25
## 3     318     293    196    188
## 4     378     244    136    237
## 5     456     450     58     31

The clustering can be done in environmental space by supplying r. Notice, this could be an extreme case of cross-validation as the testing folds could possibly fall in novel environmental conditions than what the training points are (check cv_similarity for testing this). Note that the input raster layer should cover all the species points, otherwise an error will rise. The records with no raster value should be deleted prior to the analysis.

# environmental clustering
ecv <- cv_cluster(x = pa_data,
                  column = "occ",
                  r = rasters,
                  k = 5, 
                  scale = TRUE)
##   train_0 train_1 test_0 test_1
## 1     494     337     20    144
## 2     431     467     83     14
## 3     287     419    227     62
## 4     429     269     85    212
## 5     415     432     99     49

When r is supplied, all the input rasters are first centred and scaled to avoid one raster variable dominate the clusters.

By default, the clustering will be done based only on the values of the predictors at the sample points. In this case, and the number of the folds will be the same as k. If raster_cluster = TRUE, the clustering is done in the raster space. In this approach, the clusters will be consistent throughout the region and across species (in the same region). However, this may result in cluster(s) that cover none of the species records especially when species data is not dispersed throughout the region (or environmental ranges) or the number of clusters (k or folds) is high.

Buffering LOO (also known as Spatial LOO)

The function cv_buffer generates spatially separated training and testing folds by considering buffers of specified distance around each observation point. This approach is a form of leave-one-out (LOO) cross-validation. Each fold is generated by excluding nearby observations around each testing point within the specified distance (ideally the range of spatial autocorrelation). In this method the test set never directly abuts a training set.

Using buffering to create CV folds:

# buffering with presence-background data
bloo <- cv_buffer(x = pa_data,
                  column = "occ",
                  size = 400000)

For species presence-absence data and any other types of data (such as continuous, counts, and multi-class targets) keep presence_background = FALSE. In this case, all sample points other than the target point within the buffer are excluded, and the training set comprises all points outside the buffer.

When using species presence-background data (or presence and pseudo-absence), you need to supply the column and set presence_background = TRUE. In this case, only presence points are considered as target points. For more information read the details section in the help of the function (i.e. help(cv_buffer)).

Visualising the folds

You can visualise the generate folds for all block cross-validation strategies. You can optionally add a raster layer for background maps using r option. When r is supplied the plots might be slightly slower.

Let’s plot spatial clustering folds created in previous section (using cv_cluster):

cv_plot(cv = scv,
        x = pa_data, 
        r = rasters) # optionally add a raster background

When cv_buffer is used for plotting, only first 10 folds are shown. You can choose any set of CV folds for plotting. If remove_na = FALSE (default is TRUE), the NA in legend shows the excluded points.

cv_plot(cv = bloo,
        x = pa_data,
        num_plots = c(1, 50, 100)) 

If you do not supply x when plotting a cv_spatial object, only the spatial blocks are plotted.

cv_plot(cv = sb1,
        r = rasters,
        raster_colors = terrain.colors(10, alpha = 0.5),
        label_size = 4) 

Check similarity

The cv_similarity function can check for environmental similarity between the training and testing folds and thus possible extrapolation. It computes multivariate environmental similarity surface (MESS) as described in Elith et al. (2010). MESS represents how similar a point in a testing fold is to a training fold (as a reference set of points), with respect to a set of predictor variables in r. The negative values are the sites where at least one variable has a value that is outside the range of environments over the reference set, so these are novel environments.

cv_similarity(cv = sb1, 
              x = pa_data, 
              r = rasters, 
              progress = FALSE)

This could be used to see the amount of di

cv_similarity(cv = ecv, 
              x = pa_data, 
              r = rasters, 
              progress = FALSE)

Estimating size: the effective range of spatial autocorrelation

To support a first choice of block size, prior to any model fitting, package blockCV includes the option for the user to look at the existing autocorrelation in the response or predictors (as an indication of landscape spatial structure). This tool does not suggest any absolute solution to the problem, but serves as a guide to the user. It provides information about the effective range of spatial autocorrelation which is the range over which observations are independent.

When only r is supplied, the cv_spatial_autocor function works by automatically fitting variograms to each continuous raster and finding the effective range of spatial autocorrelation. Variogram is a fundamental geostatistical tool for measuring spatial autocorrelation (O’Sullivan and Unwin, 2010).

sac1 <- cv_spatial_autocor(r = rasters, 
                           num_sample = 5000)

The plotted block size is based on the median of the spatial autocorrelation ranges. This could be as the minimum block size for creating spatially separated folds. Variograms are computed taking a number of random points (5000 as default) from each input raster file. The variogram fitting procedure is done using automap package (Hiemstra et al., 2009), using the isotropic variogram and assuming the data meets the geostatistical criteria e.g. stationarity.

The output object of this function is an cv_spatial_autocor object, an object of class S3.

# class of the output result
class(sac1)
## [1] "cv_spatial_autocor"

To see the details of the fitted variograms:

# summary statistics of the output
summary(sac1)
## Summary statistics of spatial autocorrelation ranges of all input layers:
## Length  Class   Mode 
##      0   NULL   NULL 
## NULL

Alternatively, only use the response data using x and column. This could be a binary or continuous variable provided in as a column in the sample points sf object.

sac2 <- cv_spatial_autocor(x = pa_data, 
                           column =  "occ")

To visualise them (this needs the automap package to be loaded):

library(automap)
## Loading required package: sp
plot(sac2$variograms[[1]])

Package blockCV also provides a visualisation tool for assisting in block size selection. This tool is developed as local web applications using R package shiny. With cv_block_size, the user can choose among block sizes in a specified range, visualise the resulting blocks interactively, viewing the impact of block size on number and arrangement of blocks in the landscape (and optionally on the distribution of sample points in those blocks).

Using only raster data:

cv_block_size(r = rasters)

Or use only spatial sample data:

cv_block_size(x = pa_data,
              column = "occ") # optionally add the response

Or add both raster and samples (also define a min/max size):

cv_block_size(x = pa_data,
              column = "occ",
              r = rasters,
              min_size = 2e5,
              max_size = 9e5)

Note that the interactive plots cannot be shown here, as they require opening an external window or web browser. When using cv_block_size, slide to the selected block size, and click Apply Changes to change the block size.

References:

  • Hiemstra, P. H., Pebesma, E. J., Twenhöfel, C. J., & Heuvelink, G. B. (2009). Real-time automatic interpolation of ambient gamma dose rates from the Dutch radioactivity monitoring network. Computers & Geosciences, 35(8), 1711–1721.

  • O’Sullivan, D., & Unwin, D. J. (2010). Geographic Information Analysis (2nd ed.). John Wiley & Sons.

  • Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., others, 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.

  • Telford, R. J., & Birks, H. J. B. (2009). Evaluation of transfer functions in spatially structured environments. Quaternary Science Reviews, 28(13), 1309–1316.

  • Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol. 2019; 10:225–232. doi: 10.1111/2041-210X.13107